1 Setup

  • Check your dplyr package is up-to-date by typing packageVersion("dplyr"). If the current installed version is less than 1.0, then update by typing update.packages("dplyr"). You may need to restart R to make it work.
ifelse(packageVersion("dplyr") >= 1,
  "The installed version of dplyr package is greater than or equal to 1.0.0", update.packages("dplyr")
)
## [1] "The installed version of dplyr package is greater than or equal to 1.0.0"
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(
  tidyverse, # the tidyverse framework
  here, # computational reproducibility
  gapminder, # toy data
  nycflights13, # for exercise
  ggthemes, # additional themes
  ggrepel, # annotating ggplot with text 
  patchwork, # arranging ggplots
  broom # tidying model outputs
)

2 Tidying (tidyr)

2.1 Reshaping

Signs of messy datasets

    1. Column headers are values, not variable names.
    1. Multiple variables are not stored in one column.
    1. Variables are stored in both rows and columns.
    1. Multiple types of observational units are stored in the same table.
    1. A single observational unit is stored in multiple tables.

Let’s take a look at the cases of untidy data.

Messy Data Case 1 (Source: R for Data Science)

  • Make It Longer

    Col1 Col2 Col3

Challenge: Why is this data not tidy?

table4a
  • Let’s pivot (rotate by 90 degrees).

Concept map for pivoting. By Florian Schmoll, Monica Alonso.

  • pivot_longer() increases the number of rows (longer) and decreases the number of columns. The inverse function is pivot_wider(). These functions improve the usability of gather() and spread().

What pivot_longer() does (Source: https://www.storybench.org)

Concept map for pipe operator. By Jeroen Janssens, Monica Alonso.

  • The pipe operator %>% originally comes from the magrittr package. The idea behind the pipe operator is similar to what we learned about chaining functions in high school. f: B -> C and g: A -> B can be expressed as \(f(g(x))\). The pipe operator chains operations. When you read pipe operator, read as “and then” (Wickham’s recommendation). The keyboard shortcut is ctrl + shift + M. The key idea here is not creating temporary variables and focusing on verbs (functions). We’ll learn more about this functional programming paradigm later on.
table4a 
# Old way, less intuitive
table4a %>%
  gather(
    key = "year", # Current column names
    value = "cases", # The values matched to cases
    c("1999", "2000")
  ) # Selected columns
# New way, more intuitive
table4a %>%
  pivot_longer(
    cols = c("1999", "2000"), # Selected columns
    names_to = "year", # Shorter columns (the columns going to be in one column called year)
    values_to = "cases"
  ) # Longer rows (the values are going to be in a separate column called named cases)
  • There’s another problem, did you catch it?

  • The data type of year variable should be numeric not character. By default, pivot_longer() transforms uninformative columns to character.

  • You can fix this problem by using names_transform argument.

table4a %>%
  pivot_longer(
    cols = c("1999", "2000"), # Put two columns together
    names_to = "year", # Shorter columns (the columns going to be in one column called year)
    values_to = "cases", # Longer rows (the values are going to be in a separate column called named cases)
    names_transform = list(year = readr::parse_number)
  ) # Transform the variable

Additional tips

parse_number() also keeps only numeric information in a variable.

parse_number("reply1994")
## [1] 1994

A flat file (e.g., CSV) is a rectangular shaped combination of strings. Parsing determines the type of each column and turns into a vector of a more specific type. Tidyverse has parse_ functions (from readr package) that are flexible and fast (e.g., parse_integer(), parse_double(), parse_logical(), parse_datetime(), parse_date(), parse_time(), parse_factor(), etc).

  • Let’s do another practice.

Challenge

  1. Why is this data not tidy? (This exercise comes from pivot function vigenette.) Too long or too wide?
billboard
  1. How can you fix it? Which pivot?
# Old way
billboard %>%
  gather(
    key = "week",
    value = "rank",
    starts_with("wk")
  ) %>% # Use regular expressions
  drop_na() # Drop NAs
  • Note that pivot_longer() is more versatile than gather().
# New way
billboard %>%
  pivot_longer(
    cols = starts_with("wk"), # Use regular expressions
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE # Drop NAs
  )
  • Make It Wider

  • Why is this data not tidy?

table2
  • Each observation is spread across two rows.

  • How can you fix it?: pivot_wider().

Two differences between pivot_longer() and pivot_wider()

  • In pivot_longer(), the arguments are named names_to and values_to (to).

  • In pivot_wider(), this pattern is opposite. The arguments are named names_from and values_from (from).

  • The number of required arguments for pivot_longer() is 3 (col, names_to, values_to).

  • The number of required arguments for pivot_wider() is 2 (names_from, values_from).

What pivot_wider() does (Source: https://www.storybench.org)

# Old way
table2 %>%
  spread(
    key = type,
    value = count
  )
# New way
table2 %>%
  pivot_wider(
    names_from = type, # first
    values_from = count # second
  )

Sometimes, a consultee came to me and asked: “I don’t have missing values in my original dataframe. Then R said that I have missing values after I’ve done some data transformations. What happened?”

Here’s an answer.

R defines missing values in two ways.

  • Implicit missing values: simply not present in the data.

  • Explicit missing values: flagged with NA

Challenge

The example comes from R for Data Science.

stocks <- tibble(
  year = c(2019, 2019, 2019, 2020, 2020, 2020),
  qtr = c(1, 2, 3, 2, 3, 4),
  return = c(1, 2, 3, NA, 2, 3)
)
stocks
  • Where is the explicit missing value?

  • Does stocks have implicit missing values?

# implicit missing values become explicit
stocks %>%
  pivot_wider(
    names_from = year,
    values_from = return
  )

Challenge

  • This exercise comes from pivot function vigenette.

  • Could you make station a series of dummy variables using pivot_wider()?

fish_encounters
  1. Which pivot should you use?

  2. Are there explicit missing values?

  3. How could you turn these NAs into 0s? Check values_fill argument in the pivot_wider() function.

  • Separate

Messy Data Case 2 (Source: R for Data Science)

# Toy example
df <- data.frame(x = c(NA, "Dad.apple", "Mom.orange", "Daughter.banana"))
df
# Separate
df %>%
  separate(x, into = c("Name", "Preferred_fruit"))
# Don't need the first variable
df %>%
  separate(x, into = c(NA, "Preferred_fruit"))

Practice

table3
  • Note sep argument. You can specify how to separate joined values.
table3 %>%
  separate(rate,
    into = c("cases", "population"),
    sep = "/"
  )
  • Note convert argument. You can specify whether automatically convert the new values or not.
table3 %>%
  separate(rate,
    into = c("cases", "population"),
    sep = "/",
    convert = TRUE
  ) # cases and population become integers
  • Unite

pivot_longer() <-> pivot_wider()

separate() <-> unite()

# Create a toy example
df <- data.frame(
  name = c("Jae", "Sun", "Jane", NA),
  birthmonth = c("April", "April", "June", NA)
)
# Include missing values
df %>% unite(
  "contact",
  c("name", "birthmonth")
)
# Do not include missing values
df %>% unite("contact",
  c("name", "birthmonth"),
  na.rm = TRUE
)

2.2 Filling

This is a relatively less-known function of the tidyr package. I found this function super useful to complete time-series data. For instance, how can you replace NA in the following example (this use case is drawn from the tidyr package vignette.)?

# Example 
stock <- tibble::tribble(
  ~ quarter, ~ year, ~stock_price, 
  "Q1", 2000, 10000, 
  "Q2", NA, 10001, # Replace NA with 2000  
  "Q3", NA, 10002, # Replace NA with 2000 
  "Q4", NA, 10003, # Replace NA with 2000 
  "Q1", 2001, 10004, 
  "Q2", NA, 10005, # Replace NA with 2001 
  "Q3", NA, 10006, # Replace NA with 2001 
  "Q4", NA, 10007, # Replace NA with 2001
)
fill(stock, year)

Let’s take a slightly more complex example.

# Example 
yelp_rate <- tibble::tribble(
  ~ neighborhood, ~restraurant_type, ~popularity_rate, 
  "N1", "Chinese", 5, 
  "N2", NA, 4,   
  "N3", NA, 3,  
  "N4", NA, 2,  
  "N1", "Indian", 1, 
  "N2", NA, 2,  
  "N3", NA, 3,  
  "N4", NA, 4, 
  "N1", "Mexican", 5
)
fill(yelp_rate, restraurant_type) # default is direction = .down 
fill(yelp_rate, restraurant_type, .direction = "up") 

3 Manipulating (dplyr)

Concept map for dplyr. By Monica Alonso, Greg Wilson.

dplyr is better than the base R approaches to data processing:

  • fast to run (due to the C++ backed) and intuitive to type
  • works well with tidy data and databases (thanks to dbplyr)

3.1 Rearranging

  • Arrange

  • Order rows

dplyr::arrange(mtcars, mpg) # Low to High (default)
dplyr::arrange(mtcars, desc(mpg)) # High to Row
  • Rename

  • Rename columns

df <- tibble(y = c(2011, 2012, 2013))
df %>%
  rename(
    Year = # NEW name
    y
  ) # OLD name

3.2 Subset observations (rows)

  • Choose row by logical condition

  • Single condition

starwars %>%
  filter(gender == "feminine") %>%
  arrange(desc(height))

The following filtering example was inspired by the suzanbert’s dplyr blog post.

  • Multiple conditions (numeric)
# First example
starwars %>%
  filter(height < 180, height > 160) %>%
  nrow()
## [1] 24
# Same as above
starwars %>%
  filter(height < 180 & height > 160) %>%
  nrow()
## [1] 24
# Not same as above
starwars %>%
  filter(height < 180 | height > 160) %>%
  nrow()
## [1] 81

Challenge

  1. Use filter(between()) to find characters whose heights are between 180 and 160 and (2) count the number of these observations.
  • Minimum reproducible example
df <- tibble(
  heights = c(160:180),
  char = rep("none", length(c(160:180)))
)
df %>%
  filter(between(heights, 161, 179))
  • Multiple conditions (character)
# Filter names include ars; `grepl` is a base R function
starwars %>%
  filter(grepl("ars", tolower(name)))
# Or, if you prefer dplyr way
starwars %>%
  filter(str_detect(tolower(name), "ars"))
# Filter brown and black hair_color
starwars %>%
  filter(hair_color %in% c("black", "brown"))

Challenge

Use str_detect() to find characters whose names include “Han”.

  • Choose row by position (row index)
starwars %>%
  arrange(desc(height)) %>%
  slice(1:6)
  • Sample by a fraction
# For reproducibility
set.seed(1234)
# Old way
starwars %>%
  sample_frac(0.10,
    replace = FALSE
  ) # Without replacement
# New way
starwars %>%
  slice_sample(
    prop = 0.10,
    replace = FALSE
  )
  • Sample by number
# Old way
starwars %>%
  sample_n(20,
    replace = FALSE
  ) # Without replacement
# New way
starwars %>%
  slice_sample(
    n = 20,
    replace = FALSE
  ) # Without replacement
  • Top 10 rows orderd by height
# Old way
starwars %>%
  top_n(10, height)
# New way
starwars %>%
  slice_max(height, n = 10) # Variable first, Argument second

3.3 Subset variables (columns)

names(msleep)
##  [1] "name"         "genus"        "vore"         "order"        "conservation"
##  [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
## [11] "bodywt"
  • Select only numeric columns
# Only numeric
msleep %>%
  dplyr::select(where(is.numeric))

Challenge

Use select(where()) to find only non-numeric columns

  • Select the columns that include “sleep” in their names
msleep %>%
  dplyr::select(contains("sleep"))
  • Select the columns that include either “sleep” or “wt” in their names

  • Basic R way

grepl is one of the R base pattern matching functions.

msleep[grepl("sleep|wt", names(msleep))]

Challenge

Use select(match()) to find columns whose names include either “sleep” or “wt”.

  • Select the columns that start with “b”
msleep %>%
  dplyr::select(starts_with("b"))
  • Select the columns that end with “wt”
msleep %>%
  dplyr::select(ends_with("wt"))
  • Select the columns using both beginning and end string patterns

The key idea is you can use Boolean operators (!, &, |)to combine different string pattern matching statements.

msleep %>%
  dplyr::select(starts_with("b") & ends_with("wt"))
  • Select the order and move it before everything
# By specifying a column
msleep %>%
  dplyr::select(order, everything())
  • Select variables from a character vector.
msleep %>%
  dplyr::select(any_of(c("name", "order"))) %>%
  colnames()
## [1] "name"  "order"
  • Select the variables named in character + number pattern
msleep$week8 <- NA
msleep$week12 <- NA
msleep$week_extra <- 0
msleep %>%
  dplyr::select(num_range("week", c(1:12)))

Additional tips

msleep data has nicely cleaned column names. But real-world data are usually messier. The janitor package is useful to fix this kind of problem.

messy_df <- tibble::tribble(~"ColNum1", ~"COLNUM2", ~ "COL & NUM3",
                            1, 2, 3)
messy_df
pacman::p_load(janitor)
janitor::clean_names(messy_df) 

janitor::tabyl() is helpful for doing crosstabulation and a nice alternative to table() function.

# Frequency table; The default output class is table 
table(gapminder$country)
## 
##              Afghanistan                  Albania                  Algeria 
##                       12                       12                       12 
##                   Angola                Argentina                Australia 
##                       12                       12                       12 
##                  Austria                  Bahrain               Bangladesh 
##                       12                       12                       12 
##                  Belgium                    Benin                  Bolivia 
##                       12                       12                       12 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                       12                       12                       12 
##                 Bulgaria             Burkina Faso                  Burundi 
##                       12                       12                       12 
##                 Cambodia                 Cameroon                   Canada 
##                       12                       12                       12 
## Central African Republic                     Chad                    Chile 
##                       12                       12                       12 
##                    China                 Colombia                  Comoros 
##                       12                       12                       12 
##         Congo, Dem. Rep.              Congo, Rep.               Costa Rica 
##                       12                       12                       12 
##            Cote d'Ivoire                  Croatia                     Cuba 
##                       12                       12                       12 
##           Czech Republic                  Denmark                 Djibouti 
##                       12                       12                       12 
##       Dominican Republic                  Ecuador                    Egypt 
##                       12                       12                       12 
##              El Salvador        Equatorial Guinea                  Eritrea 
##                       12                       12                       12 
##                 Ethiopia                  Finland                   France 
##                       12                       12                       12 
##                    Gabon                   Gambia                  Germany 
##                       12                       12                       12 
##                    Ghana                   Greece                Guatemala 
##                       12                       12                       12 
##                   Guinea            Guinea-Bissau                    Haiti 
##                       12                       12                       12 
##                 Honduras         Hong Kong, China                  Hungary 
##                       12                       12                       12 
##                  Iceland                    India                Indonesia 
##                       12                       12                       12 
##                     Iran                     Iraq                  Ireland 
##                       12                       12                       12 
##                   Israel                    Italy                  Jamaica 
##                       12                       12                       12 
##                    Japan                   Jordan                    Kenya 
##                       12                       12                       12 
##         Korea, Dem. Rep.              Korea, Rep.                   Kuwait 
##                       12                       12                       12 
##                  Lebanon                  Lesotho                  Liberia 
##                       12                       12                       12 
##                    Libya               Madagascar                   Malawi 
##                       12                       12                       12 
##                 Malaysia                     Mali               Mauritania 
##                       12                       12                       12 
##                Mauritius                   Mexico                 Mongolia 
##                       12                       12                       12 
##               Montenegro                  Morocco               Mozambique 
##                       12                       12                       12 
##                  Myanmar                  Namibia                    Nepal 
##                       12                       12                       12 
##              Netherlands              New Zealand                Nicaragua 
##                       12                       12                       12 
##                    Niger                  Nigeria                   Norway 
##                       12                       12                       12 
##                     Oman                 Pakistan                   Panama 
##                       12                       12                       12 
##                 Paraguay                     Peru              Philippines 
##                       12                       12                       12 
##                   Poland                 Portugal              Puerto Rico 
##                       12                       12                       12 
##                  Reunion                  Romania                   Rwanda 
##                       12                       12                       12 
##    Sao Tome and Principe             Saudi Arabia                  Senegal 
##                       12                       12                       12 
##                   Serbia             Sierra Leone                Singapore 
##                       12                       12                       12 
##          Slovak Republic                 Slovenia                  Somalia 
##                       12                       12                       12 
##             South Africa                    Spain                Sri Lanka 
##                       12                       12                       12 
##                    Sudan                Swaziland                   Sweden 
##                       12                       12                       12 
##              Switzerland                    Syria                   Taiwan 
##                       12                       12                       12 
##                 Tanzania                 Thailand                     Togo 
##                       12                       12                       12 
##      Trinidad and Tobago                  Tunisia                   Turkey 
##                       12                       12                       12 
##                   Uganda           United Kingdom            United States 
##                       12                       12                       12 
##                  Uruguay                Venezuela                  Vietnam 
##                       12                       12                       12 
##       West Bank and Gaza              Yemen, Rep.                   Zambia 
##                       12                       12                       12 
##                 Zimbabwe 
##                       12
# Frequency table (unique value, n, percentage)
janitor::tabyl(gapminder$country)
# If you want to add percentage ... 
gapminder %>%
  tabyl(country) %>%
  adorn_pct_formatting(digits = 0, affix_sign = TRUE)

3.4 Create variables

3.5 Change values using conditions

You can think of case_when() (multiple conditions) as an extended version of ifelse() (binary conditions).

mtcars <- mtcars %>%
  mutate(cyl_dummy = case_when(cyl > median(cyl) ~ "High", # if condition
                               cyl < median(cyl) ~ "Low", # else if condition 
                               TRUE ~ 'Median')) # else condition 
mtcars %>% pull(cyl_dummy)
##  [1] "Median" "Median" "Low"    "Median" "High"   "Median" "High"   "Low"   
##  [9] "Low"    "Median" "Median" "High"   "High"   "High"   "High"   "High"  
## [17] "High"   "Low"    "Low"    "Low"    "Low"    "High"   "High"   "High"  
## [25] "High"   "Low"    "Low"    "Low"    "High"   "Median" "High"   "Low"

3.6 Change values manually

mtcars %>%
  mutate(cyl_dummy = recode(cyl_dummy, # Target column 
                            "High" = "2", # Old - New
                            "Low" = "0",
                            "Median" = "1")) %>%
  pull(cyl_dummy)
##  [1] "1" "1" "0" "1" "2" "1" "2" "0" "0" "1" "1" "2" "2" "2" "2" "2" "2" "0" "0"
## [20] "0" "0" "2" "2" "2" "2" "0" "0" "0" "2" "1" "2" "0"

3.7 Counting

  • How may countries in each continent?
gapminder %>%
  count(continent)
  • Let’s arrange the result.
# Just add a new argument `sort = TRUE`
gapminder %>%
  count(continent, sort = TRUE)
# Same as above; How nice!
gapminder %>%
  count(continent) %>%
  arrange(desc(n))

Challenge

Count the number of observations per continent and year and arrange them in descending order.

Let’s take a deeper look at how things work under the hood.

  • tally() works similar to nrow(): Calculate the total number of cases in a dataframe

  • count = group_by() + tally()

gapminder %>%
  tally()
  • add_tally() = mutate(n = n())

Challenge

What does n in the below example represent?

gapminder %>%
  dplyr::select(continent, country) %>%
  add_tally()
  • add_count

Add count as a column.

# Add count as a column
gapminder %>%
  group_by(continent) %>%
  add_count(year)

Challenge

Do cases 1 and 2 in the below code chunk produce the same outputs? If so, why?

# Case 1
gapminder %>%
  group_by(continent, year) %>%
  count()
# Case 2
gapminder %>%
  group_by(continent) %>%
  count(year)

count() is a simple function, but it is still helpful to learn an essential concept underlying complex data wrangling: split-apply-combine strategy. For more information, read Wickham’s article (2011) “The Split-Apply-Combine Strategy for Data Analysis” published in the Journal of Statistical Software (especially pages 7-8). plyr was the package (retired) that demonstrated this idea, which has evolved into two directions: dplyr (for data frames) and purrr (for lists)

3.8 Summarizing

3.8.1 Basic

  • Create a summary
  • Think of summarise() as an extended version of count().
gapminder %>%
  group_by(continent) %>%
  summarise(
    n = n(),
    mean_gdp = mean(gdpPercap),
    sd_gdp = sd(gdpPercap)
  )
tablea <- gapminder %>%
  group_by(continent) %>%
  summarise(
    n = n(),
    mean_gdp = mean(gdpPercap),
    sd_gdp = sd(gdpPercap)
  )
  • Produce publishable tables
pacman::p_load(kableExtra,
               flextable)
# For HTML and LaTeX
tablea %>% kableExtra::kable()
continent n mean_gdp sd_gdp
Africa 624 2193.755 2827.930
Americas 300 7136.110 6396.764
Asia 396 7902.150 14045.373
Europe 360 14469.476 9355.213
Oceania 24 18621.609 6358.983
# For HTML and MS Office suite
tablea %>% flextable::flextable()

3.8.2 Scoped summaries

  • Old way

  • summarise_all()

# Create a wide-shaped data example
wide_gapminder <- gapminder %>%
  filter(continent == "Europe") %>%
  pivot_wider(
    names_from = country,
    values_from = gdpPercap
  )
# Apply summarise_all
wide_gapminder %>%
  dplyr::select(-c(1:4)) %>%
  summarise_all(mean, na.rm = TRUE)
  • summarise_if(): using a logical condition
wide_gapminder %>%
  summarise_if(is.double, mean, na.rm = TRUE)
  • summarise_at()

  • vars() = select()

wide_gapminder %>%
  summarise_at(vars(-c(1:4)),
    mean,
    na.rm = TRUE
  )
wide_gapminder %>%
  summarise_at(vars(contains("life")),
    mean,
    na.rm = TRUE
  )

Additional tips

Concept map for regular expressions. By Monica Alonso, Greg Wilson.

  • New way

  • summarise() + across()

Concept map for across. By Emma Vestesson

  • If you find using summarise_all(), summarise_if() and summarise_at() confusing, here’s a solution: use summarise() with across().

  • summarise_all()

wide_gapminder %>%
  summarise(across(Albania:`United Kingdom`, mean, na.rm = TRUE))
wide_gapminder %>%
  summarise(across(-c(1:4), mean, na.rm = TRUE))
  • summarise_if()
wide_gapminder %>%
  summarise(across(is.double, mean, na.rm = TRUE))
## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.double)
## 
##   # Good
##   data %>% select(where(is.double))
## 
## ℹ Please update your code.
## This message is displayed once per session.
  • summarise_at()
wide_gapminder %>%
  summarise(across(-c(1:4),
    mean,
    na.rm = TRUE
  ))
wide_gapminder %>%
  summarise(across(contains("life"),
    mean,
    na.rm = TRUE
  ))
wide_gapminder %>%
  summarise(across(contains("A", ignore.case = FALSE)))

Note that this workshop does not cover creating and manipulating variables using mutate() because many techniques you learned from playing with summarise() can be directly applied to mutate().

Challenge

  1. Summarize the average GDP of countries whose names starting with the alphabet “A”.

  2. Turn the summary dataframe into a publishable table using either kableExtra or flextable package.

3.9 Grouping

3.9.1 Grouped summaries

  • Calculate the mean of gdpPercap.

  • Some functions are designed to work together. For instance, the group_by function defines the strata that you’re going to use for summary statistics. Then, use summarise() or summarize() for producing summary statistics.

gapminder %>%
  group_by(continent) %>% #
  summarise(mean_gdp = mean(gdpPercap))
  • Calculate multiple summary statistics.
gapminder %>%
  group_by(continent) %>% #
  summarise(
    mean_gdp = mean(gdpPercap),
    count = n()
  )

Optional

  • Other summary statistics
  1. Measures of spread: median(x), sd(x), IQR(x), mad(x) (the median absolute deviation)
# The Interquartile Range = The Difference Between 75t and 25t Percentiles
gapminder %>%
  group_by(continent) %>% #
  summarise(IQR_gdp = IQR(gdpPercap))
  1. Measures of rank: min(x), quantile(x, 0.25), max(x)
gapminder %>%
  group_by(continent) %>% #
  summarise(
    min_gdp = min(gdpPercap),
    max_gdp = max(gdpPercap)
  )
  1. Measures of position: first(x), last(x), nth(x, 2)
gapminder %>%
  group_by(continent) %>%
  summarise(
    first_gdp = first(gdpPercap),
    last_gdp = last(gdpPercap)
  )
gapminder %>%
  group_by(continent) %>%
  arrange(gdpPercap) %>% # Adding arrange
  summarise(
    first_gdp = first(gdpPercap),
    last_gdp = last(gdpPercap)
  )
  1. Measures of counts: n(x) (all rows), sum(!is.na(x)) (only non-missing rows) = n_distinct(x)
gapminder %>%
  group_by(continent) %>%
  summarise(ns = n())
  1. Counts and proportions of logical values: sum(condition about x) (the number of TRUEs in x), mean(condition about x) (the proportion of TRUEs in x)
gapminder %>%
  group_by(continent) %>%
  summarise(rich_countries = mean(gdpPercap > 20000))

Additional tips

Also, check out window functions such as cumsum() and lag(). Window functions are a variant of aggregate functions that take a vector as input then return a vector of the same length as an output.

vec <- c(1:10)
# Typical aggregate function
sum(vec) # The output length is one
## [1] 55
# Window function
cumsum(vec) # The output length is ten
##  [1]  1  3  6 10 15 21 28 36 45 55

3.10 Joining

Relational data = multiple tables of data

Relational data example

Key ideas

  • A primary key “uniquely identifies an observation in its table”
# Example
planes$tailnum %>% head()
## [1] "N10156" "N102UW" "N103US" "N104UW" "N10575" "N105UW"

Verify primary key

tailnum should be unique.

Challenge

What do you expect the outcome?

planes %>%
  count(tailnum) %>%
  filter(n > 1)

Optional

If a dataframe doesn’t have a primary key, you can add one called a surrogate key.

# Toy example
df <- tibble(
  x = c(1:3),
  y = c(4:6)
)
# Add a row_index column
df <- df %>% rowid_to_column("ID")
  • A foreign key “uniquely identifies an observation in another table.”
flights$tailnum %>% head()
## [1] "N14228" "N24211" "N619AA" "N804JB" "N668DN" "N39463"

For joining, don’t be distracted by other details and focus on KEYS!

3.10.1 Mutating joins

Add new variables to one data frame from matching observations in another" Using a simple toy example is great because it is easy to see how things work in that much narrow context.

  • Toy example
# Table 1
x <- tibble(
  key = c(1:4),
  val_x = c("x1", "x2", "x3", "x4")
)
# Table 2
y <- tibble(
  key = c(1:5),
  val_y = c("y1", "y2", "y3", "y4", "y5")
)
  • Inner Join

inner_join() keeps the matched values in both tables. If the left table is a subset of the right table, then left_join() is the same as inner_join().

Challenge

What is going to be the shared keys?

inner_join(x, y)
## Joining, by = "key"

Mutating joins

  • Left Join

left_join(), right_join() and full_join() are outer join functions. Unlike inner_join(), outer join functions keep observations that appear in at least one of the tables.

left_join() keeps only the matched observations in the right table.

left_join(x, y)
## Joining, by = "key"
  • Right Join

right_join() does the opposite.

right_join(x, y)
## Joining, by = "key"
  • Full Join

full_join() keeps the observations from both tables. If they were unmatched, then NAs were recoded in one of the two tables.

full_join(x, y)
## Joining, by = "key"

3.10.2 Filtering joins

Filter observations from one data frame based on whether they match an observation in the other table. - Semi Join

In SQL, this type of query is also called subqueries.

  • Filtering without joining
# Create the list of the top 10 destinations
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  top_n(10)
## Selecting by n
# Filter
filtered <- flights %>%
  filter(dest %in% top_dest$dest)
  • Using semi join: only keep (INCLUDE) the rows that were matched between the two tables
joined <- flights %>%
  semi_join(top_dest)
## Joining, by = "dest"
head(filtered == joined)
##      year month  day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## [1,] TRUE  TRUE TRUE     TRUE           TRUE      TRUE     TRUE           TRUE
## [2,] TRUE  TRUE TRUE     TRUE           TRUE      TRUE     TRUE           TRUE
## [3,] TRUE  TRUE TRUE     TRUE           TRUE      TRUE     TRUE           TRUE
## [4,] TRUE  TRUE TRUE     TRUE           TRUE      TRUE     TRUE           TRUE
## [5,] TRUE  TRUE TRUE     TRUE           TRUE      TRUE     TRUE           TRUE
## [6,] TRUE  TRUE TRUE     TRUE           TRUE      TRUE     TRUE           TRUE
##      arr_delay carrier flight tailnum origin dest air_time distance hour minute
## [1,]      TRUE    TRUE   TRUE    TRUE   TRUE TRUE     TRUE     TRUE TRUE   TRUE
## [2,]      TRUE    TRUE   TRUE    TRUE   TRUE TRUE     TRUE     TRUE TRUE   TRUE
## [3,]      TRUE    TRUE   TRUE    TRUE   TRUE TRUE     TRUE     TRUE TRUE   TRUE
## [4,]      TRUE    TRUE   TRUE    TRUE   TRUE TRUE     TRUE     TRUE TRUE   TRUE
## [5,]      TRUE    TRUE   TRUE    TRUE   TRUE TRUE     TRUE     TRUE TRUE   TRUE
## [6,]      TRUE    TRUE   TRUE    TRUE   TRUE TRUE     TRUE     TRUE TRUE   TRUE
##      time_hour
## [1,]      TRUE
## [2,]      TRUE
## [3,]      TRUE
## [4,]      TRUE
## [5,]      TRUE
## [6,]      TRUE
  • Anti Join

anti_join() does the opposite. Exclude the rows that were matched between the two tables. A great technique to filter stopwords when you do computational text analysis.

flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)

4 Modeling (broom)

4.1 Nesting

4.1.1 nest

The following example comes from R for Data Science by Garrett Grolemund and Hadley Wickham.

  • How can you run multiple models simultaneously? Using a nested data frame.

Hadley Wickham: Managing many models with R

  • Grouped data: each row = an observation

  • Nested data: each row = a group

Challenge

In the following example, why did we use country and continent for nesting variables?

nested <- gapminder %>%
  group_by(country, continent) %>%
  nest()
head(nested)
nested$data %>% pluck(1)
  • Custom function
lm_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}
  • Apply function to the nested data
# Apply m_model to the nested data
nested <- nested %>%
  mutate(models = map(data, lm_model)) # Add the list object as a new column
head(nested)

S3 is part of R’s object-oriented systems. If you need further information, check out this section in Hadley’s Advanced R.

4.1.2 unnest

  • glance()

glance() function from broom package inspects the quality of a statistical model.

Additional tips

  • broom::glance(model): for evaluating model quality and/or complexity
  • broom::tidy(model): for extracting each coefficient in the model (the estimates + its variability)
  • broom::augment(model, data): for getting extra values (residuals, and influence statistics). A convenient tool in case if you want to plot fitted values and raw data together.

Broom: Converting Statistical Models to Tidy Data Frames by David Robinson

glanced <- nested %>%
  mutate(glance = map(models, broom::glance))
# Pluck the first item on the list 
glanced$glance %>% pluck(1)
# Pull p.value 
glanced$glance %>% pluck(1) %>% pull(p.value)
##        value 
## 9.835213e-08

unnest() unpacks the list objects stored in the glanced column

glanced %>%
  unnest(glance) %>%
  arrange(r.squared) 
glanced %>%
  unnest(glance) %>%
  ggplot(aes(continent, r.squared)) +
  geom_jitter(width = 0.5)

  • tidy()
nested <- gapminder %>%
  group_by(continent) %>%
  nest()
nested <- nested %>%
  mutate(models = map(data, ~lm(lifeExp ~ year + country, data = .))) 
tidied <- nested %>%
  mutate(tidied = map(models, broom::tidy))
model_out <- tidied %>%
  unnest(tidied) %>%
  mutate(term = str_replace(term, "country", "")) %>%
  select(continent, term, estimate, p.value) %>%
  mutate(p_threshold = ifelse(p.value < 0.05, 1, 0))
model_out %>% filter(p_threshold == 1) %>% pull(term) %>% unique()
##   [1] "(Intercept)"              "year"                    
##   [3] "Bahrain"                  "Bangladesh"              
##   [5] "Cambodia"                 "China"                   
##   [7] "Hong Kong, China"         "India"                   
##   [9] "Indonesia"                "Iran"                    
##  [11] "Iraq"                     "Israel"                  
##  [13] "Japan"                    "Jordan"                  
##  [15] "Korea, Dem. Rep."         "Korea, Rep."             
##  [17] "Kuwait"                   "Lebanon"                 
##  [19] "Malaysia"                 "Mongolia"                
##  [21] "Myanmar"                  "Nepal"                   
##  [23] "Oman"                     "Pakistan"                
##  [25] "Philippines"              "Saudi Arabia"            
##  [27] "Singapore"                "Sri Lanka"               
##  [29] "Syria"                    "Taiwan"                  
##  [31] "Thailand"                 "Vietnam"                 
##  [33] "West Bank and Gaza"       "Yemen, Rep."             
##  [35] "Austria"                  "Belgium"                 
##  [37] "Croatia"                  "Czech Republic"          
##  [39] "Denmark"                  "Finland"                 
##  [41] "France"                   "Germany"                 
##  [43] "Greece"                   "Iceland"                 
##  [45] "Ireland"                  "Italy"                   
##  [47] "Montenegro"               "Netherlands"             
##  [49] "Norway"                   "Poland"                  
##  [51] "Portugal"                 "Slovak Republic"         
##  [53] "Slovenia"                 "Spain"                   
##  [55] "Sweden"                   "Switzerland"             
##  [57] "Turkey"                   "United Kingdom"          
##  [59] "Angola"                   "Benin"                   
##  [61] "Botswana"                 "Burkina Faso"            
##  [63] "Burundi"                  "Cameroon"                
##  [65] "Central African Republic" "Chad"                    
##  [67] "Comoros"                  "Congo, Dem. Rep."        
##  [69] "Congo, Rep."              "Cote d'Ivoire"           
##  [71] "Djibouti"                 "Equatorial Guinea"       
##  [73] "Eritrea"                  "Ethiopia"                
##  [75] "Gabon"                    "Gambia"                  
##  [77] "Ghana"                    "Guinea"                  
##  [79] "Guinea-Bissau"            "Kenya"                   
##  [81] "Lesotho"                  "Liberia"                 
##  [83] "Madagascar"               "Malawi"                  
##  [85] "Mali"                     "Mauritania"              
##  [87] "Mauritius"                "Mozambique"              
##  [89] "Namibia"                  "Niger"                   
##  [91] "Nigeria"                  "Reunion"                 
##  [93] "Rwanda"                   "Senegal"                 
##  [95] "Sierra Leone"             "Somalia"                 
##  [97] "South Africa"             "Sudan"                   
##  [99] "Swaziland"                "Tanzania"                
## [101] "Togo"                     "Uganda"                  
## [103] "Zambia"                   "Zimbabwe"                
## [105] "Bolivia"                  "Brazil"                  
## [107] "Canada"                   "Colombia"                
## [109] "Dominican Republic"       "Ecuador"                 
## [111] "El Salvador"              "Guatemala"               
## [113] "Haiti"                    "Honduras"                
## [115] "Mexico"                   "Nicaragua"               
## [117] "Paraguay"                 "Peru"                    
## [119] "Puerto Rico"              "Trinidad and Tobago"     
## [121] "United States"            "Venezuela"               
## [123] "New Zealand"
model_out %>% filter(p_threshold == 0) %>% pull(term) %>% unique()
##  [1] "Bosnia and Herzegovina" "Bulgaria"               "Hungary"               
##  [4] "Romania"                "Serbia"                 "Egypt"                 
##  [7] "Libya"                  "Morocco"                "Sao Tome and Principe" 
## [10] "Tunisia"                "Chile"                  "Costa Rica"            
## [13] "Cuba"                   "Jamaica"                "Panama"                
## [16] "Uruguay"

4.1.3 Mapping

We tasted a little bit about how map() function works. Let’s dig into it more in-depth as this family of functions is useful. For more information, see Rebecca Barter’s excellent tutorial on the purrr package. In her words, this is “the tidyverse’s answer to apply functions for iteration”. map() function can take a vector (of any type), a list, and a dataframe for input.

multiply <- function(x) {
  x * x
}
df <- list(
  first_obs = rnorm(7, 1, sd = 1),
  second_obs = rnorm(7, 2, sd = 2)
) # normal distribution

Challenge

Try map_df(.x = df, .f = multiply) and tell me what’s the difference between the output you got and what you saw earlier.

If you want to know more about the power and joy of functional programming in R (e.g., purrr::map()), then please take “How to Automate Repeated Things in R” workshop.

5 Visualizing (ggplot2)

  • The following material is adapted from Kieran Healy’s excellent book (2019) on data visualization and Hadley Wickham’s equally excellent book on ggplot2. For more theoretical discussions, I recommend you to read The Grammar of Graphics by Leland Wilkinson.

  • Why should we care about data visualization? More precisely, why should we learn the grammar of statistical graphics?

  • Sometimes, pictures are better tools than words in 1) exploring, 2) understanding, and 3) explaining data.

5.1 Motivation

Anscombe’s quarter comprises four datasets, which are so alike in terms of their descriptive statistics but quite different when presented graphically.

# Set theme
theme_set(theme_minimal())
# Data
anscombe
# Correlation
cor(anscombe)[c(1:4), c(5:8)]
##            y1         y2         y3         y4
## x1  0.8164205  0.8162365  0.8162867 -0.3140467
## x2  0.8164205  0.8162365  0.8162867 -0.3140467
## x3  0.8164205  0.8162365  0.8162867 -0.3140467
## x4 -0.5290927 -0.7184365 -0.3446610  0.8165214
# gather and select
anscombe_processed <- anscombe %>%
  gather(x_name, x_value, x1:x4) %>%
  gather(y_name, y_value, y1:y4)
# plot
anscombe_processed %>%
  ggplot(aes(x = x_value, y = y_value)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  facet_grid(x_name ~ y_name) +
  theme_bw() +
  labs(
    x = "X values",
    y = "Y values",
    title = "Anscombe's quartet"
  )
## `geom_smooth()` using formula 'y ~ x'

5.2 The grammar of graphics

  • the grammar of graphics

    • data
    • aesthetic attributes (color, shape, size)
    • geometric objects (points, lines, bars)
    • stats (summary stats)
    • scales (map values in the data space)
    • coord (data coordinates)
    • facet (facetting specifications)

No worries about new terms. We’re going to learn them by actually plotting.

  • Workflow:

    1. Tidy data
    2. Mapping
    3. Geom
    4. Cor_ordinates and scales
    5. Labels and guides
    6. Themes
    7. Save files

5.3 mapping and geom

  • aes (aesthetic mappings or aesthetics) tells which variables (x, y) in your data should be represented by which visual elements (color, shape, size) in the plot.

  • geom_ tells the type of plot you are going to use

5.4 basic aes (x , y)

p <- ggplot(
  data = gapminder,
  mapping = aes(x = gdpPercap, y = lifeExp)
) # ggplot or R in general takes positional arguments too. So, you don't need to name data, mapping each time you use ggplot2.
p

p + geom_point()

p + geom_point() + geom_smooth() # geom_smooth has calculated a smoothed line;
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# the shaded area is the standard error for the line

5.5 Univariate distribution

  • geom_histogram(): For the probability distribution of a continuous variable. Bins divide the entire range of values into a series of intervals (see the Wiki entry).
  • geom_density(): Also for the probability distribution of a continuous variable. It calculates a kernel density estimate of the underlying distribution.

5.5.1 Histogram

data(midwest) # load midwest dataset
midwest
midwest %>%
  ggplot(aes(x = area)) +
  geom_point() # not working.
midwest %>%
  ggplot(aes(x = area)) +
  geom_histogram() # stat_bin argument picks up 30 bins (or "bucket") by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

midwest %>%
  ggplot(aes(x = area)) +
  geom_histogram(bins = 10) # only 10 bins.

ggplot(
  data = subset(midwest, state %in% c("OH", "IN")),
  mapping = aes(x = percollege, fill = state)
) +
  geom_histogram(alpha = 0.7, bins = 20) +
  scale_fill_viridis_d()

5.5.2 Density

midwest %>%
  ggplot(aes(x = area, fill = state, color = state)) +
  geom_density(alpha = 0.3) +
  scale_color_viridis_d() +
  scale_fill_viridis_d()

5.6 Advanced aes (size, color)

  • There’s also fill argument (mostly used in geom_bar()). Color aes affects the appearance of lines and points, fill is for the filled areas of bars, polygons, and in some cases, the interior of a smoother’s standard error ribbon.

  • The property size/color/fill represents…

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop
  )
) +
  geom_point()

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop,
    color = continent
  )
) +
  geom_point() +
  scale_color_viridis_d()

# try red instead of "red"
ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop,
    color = "red"
  )
) +
  geom_point()

Aesthetics also can be mapped per Geom.

p + geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p + geom_point(alpha = 0.3) + # alpha controls transparency
  geom_smooth(color = "red", se = FALSE, size = 2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p + geom_point(alpha = 0.3) + # alpha controls transparency
  geom_smooth(color = "red", se = FALSE, size = 2, method = "lm")
## `geom_smooth()` using formula 'y ~ x'

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    color = continent
  )
) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )
## `geom_smooth()` using formula 'y ~ x'

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    color = continent,
    fill = continent
  )
) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  scale_color_viridis_d() +
  scale_fill_viridis_d()
## `geom_smooth()` using formula 'y ~ x'

5.7 Co-ordinates and scales

p + geom_point() +
  coord_flip() # coord_type

The data is heavily bunched up against the left side.

p + geom_point() # without scaling

p + geom_point() +
  scale_x_log10() # scales the axis of a plot to a log 10 basis

p + geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10()
## `geom_smooth()` using formula 'y ~ x'

5.8 Labels and guides

scales package has some useful premade formatting functions. You can either load scales or just grab the function you need from the library using scales::

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )
## `geom_smooth()` using formula 'y ~ x'

  1. Themes
p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  theme_economist()
## `geom_smooth()` using formula 'y ~ x'

5.9 ggsave

figure_example <- p + geom_point(alpha = 0.3) +
  geom_smooth(method = "gam", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  theme_economist()
ggsave(figure_example, here("outputs", "figure_example.png"))

5.10 Many plots

Basic ideas:

  • Grouping: tell ggplot2 about the structure of your data
  • Facetting: break up your data into pieces for a plot

5.10.1 Grouping

  • Can you guess what’s wrong?
p <- ggplot(gapminder, aes(x = year, y = gdpPercap))
p + geom_point()

p + geom_line()

geom_line joins up all the lines for each particular year in the order they appear in the dataset. ggplot2 does not know the yearly observations in your data are grouped by country.

Note that you need grouping when the grouping information you need to tell is not built into the mapped variables (like continent).

gapminder

5.10.2 Facetting

Facetting is to make small multiples.

  • facet_wrap: based on a single categorical variable like facet_wrap(~single_categorical_variable). Your panels will be laid out in order and then wrapped into a grid.

  • facet_grid: when you want to cross-classify some data by two categorical variables like facet_grid(one_cat_variable ~ two_cat_variable).

p <- ggplot(gapminder, aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country)) # group by, # The outlier is Kuwait.

p + geom_line(aes(group = country)) + facet_wrap(~continent) # facetting

p + geom_line(aes(group = country), color = "gray70") +
  geom_smooth(size = 1.1, method = "loess", se = FALSE) +
  scale_y_log10(labels = scales::dollar) +
  facet_wrap(~continent, ncol = 5) + # for single categorical variable; for multiple categorical variables use facet_grid()
  labs(
    x = "Year",
    y = "GDP per capita",
    title = "GDP per capita on Five continents"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using formula 'y ~ x'

p + geom_line(aes(group = country), color = "gray70") +
  geom_smooth(size = 1.1, method = "loess", se = FALSE) +
  scale_y_log10(labels = scales::dollar) +
  facet_grid(~continent) + # for single categorical variable; for multiple categorical variables use facet_grid()
  labs(
    x = "Year",
    y = "GDP per capita",
    title = "GDP per capita on Five continents"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using formula 'y ~ x'

5.10.3 Transforming

  • Transforming: perform some calculations on or summarize your data before producing the plot

5.10.4 Use pipes to summarize data

Also, we experiment bar charts here. By default, geom_bar uses stat = “bins”, which makes the height of each bar equal to the number of cases in each group. If you have a y column, then you should use stat = "identity" argument. Alternatively, you can use geom_col().

gapminder_formatted <- gapminder %>%
  group_by(continent, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  )
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
ggplot(data = gapminder_formatted, aes(x = year, y = lifeExp_mean, color = continent)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy on Five continents"
  )

gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean, color = country)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  )
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

# geom point
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  facet_wrap(~country)
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

# geom bar
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean)) +
  geom_bar(stat = "identity") +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  facet_wrap(~country)
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

# no facet
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean, fill = country)) +
  geom_bar(stat = "identity") + # even if you not stack, still the plot looks messy or you can use geom_col()
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  )
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = country, y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

# without ordering
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = reorder(country, lifeExp_mean), y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

# reorder
gapminder %>%
  filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = reorder(country, -lifeExp_mean), y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.

5.10.5 Plotting text

gapminder %>%
  filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_text(aes(label = country)) +
  scale_x_log10() +
  facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.

# with label
gapminder %>%
  filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_label(aes(label = country)) +
  scale_x_log10() +
  facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.

# no overlaps
gapminder %>%
  filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_text_repel(aes(label = country)) + # there's also geom_label_repel
  scale_x_log10() +
  facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

5.11 Ploting models

In plotting models, we extensively use David Robinson’s broom package in R. The idea is to transform model outputs (i.e., predictions and estimations) into tidy objects so that we can easily combine, separate, and visualize these elements.

5.11.1 Plotting several fits at the same time

model_colors <- RColorBrewer::brewer.pal(3, "Set1") # select three qualitatively different colors from a larger palette.
gapminder %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS")) +
  geom_smooth(
    method = "lm", formula = y ~ splines::bs(x, df = 3),
    aes(color = "Cubic Spline", fill = "Cubic Spline")
  ) +
  geom_smooth(method = "loess", aes(color = "LOESS", fill = "LOESS")) +
  theme(legend.position = "top") +
  scale_color_manual(name = "Models", values = model_colors) +
  scale_fill_manual(name = "Models", values = model_colors)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

5.11.2 Extracting model outcomes

# regression model
out <- lm(
  formula = lifeExp ~ gdpPercap + pop + continent,
  data = gapminder
)

tidy() is a method in the broom package. It “constructs a dataframe that summarizes the model’s statistical findings”. As the description states, tidy is a function that can be used for various models. For instance, a tidy can extract following information from a regression model.

  • Term: a term being estimated
  • p.value
  • statistic: a test statistic used to compute p-value
  • estimate
  • conf.low: the low end of a confidence interval
  • conf.high: the high end of a confidence interval
  • df: degrees of freedom

Challenge

Try glance(out), what did you get from these commands? If you’re curious, you can try ?glance.

The followings are to show your degree of confidence.

5.11.2.1 Coefficients

# estimates
out_comp <- tidy(out)
p <- out_comp %>%
  ggplot(aes(x = term, y = estimate))
p + geom_point() +
  coord_flip() +
  theme_bw()

5.11.2.2 Confidence intervals

# plus confidence intervals
out_conf <- tidy(out, conf.int = TRUE)
# plotting coefficients using ggplot2 (pointrange)
out_conf %>%
  ggplot(aes(x = reorder(term, estimate), y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange() +
  coord_flip() +
  labs(x = "", y = "OLS Estimate") +
  theme_bw()

# another way to do it (errorbar)
out_conf %>%
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  labs(y = "", x = "OLS Estimate") +
  theme_bw()

You can calculate marginal effects using the margins package. For the sake of time, I’m not covering that here.